BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION
نویسنده
چکیده
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translations of compound nouns and over 50% of unknown words are obtained as tbe first candidate from small Japanese/Englisb parallel texts containing severe distortions. 1 I N T R O D U C T I O N Parallel texts (corpora) are useful resources for acquiring a variety of linguistic knowledge (Dangan, 1991; Matsumoto, 1993), especially for machine translation systems which inherently require customizations. Translation dictionaries are, needless to say, the most basic and powerful knowledge source for improving and customizing translation systems. Our research interest lies in automatic generation of translation dictionaries from parallel texts. In this perspective, finding corresponding words or phrases in bilingual texts will be the fundamental factor for accurate translation. Statistics-based processing has proven to be very powerful for aligning sentences and words in parallel corpora (Brown, 1991; Gale, 1993; Chen, 1993). Kupiec proposes an Mgorithm for finding ~loun phrases in bilingual corpora (Kupiec, 1993). In this algo o rithm, noui~-phrase candidates are extracted from tagged and aligned parallel texts using a noun phrase recognizer and tile correspondences of these nonn phrases are calculated based on the EM algorithm. Accuracy of around 90% has been attained for the Imndred highest ranking con'espondenccs. Statisticsbased processing is effective when a relatively large amount of parallel texts is available, i.e. when high frequencies are obtained. On the other hand, existing linguistic knowledge can be used for finding corresponding words or phrases in parallel texts. For example, possible target expressions for a source expression provided by a translation system (linguistic knowledge source) can be a key in searching the corresponding expressions in a corpus (Nogami, 1991; Katoh, 1993). Yanramoto (1993) proposes a method for generating a translation dictionary from Japanese/English parallel texts. In this method, English and Japanese compound noun phrases are extracted from parallel texts and their correspondences are searched by matching their possible translations generated by tile existing translation dictionary. However, acquirable noun phrases are limited by tile linguistic generative power of the translation dictionary. Furthernlore, tiffs method utilizes no sentence alignmeat information which can reduce errors in finding noun phrase correspondences. This paper proposes a new method for generating an MT dictionary from parallel texts. It utilizes both statistical and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by the above linguistic-based method can be extracted, and a highly accurate translation dictionary is generated from relatively small par:dlel texts. 2 A P P R O A C t t T O B U I L D I N G A N M T 1 ) I C T I O N A R Y Our goal in building an MT dictionary from parallcl texts is to develop a robust method which enables highly accurate extraction of translation pairs from a relatively small amount of parallel texts as well as from parallel texts containing severe distortions. In real-world applications, generally it is extremely difficult especially for MT users to obtain a large amount of high quality parallel texts of one specific domain. If source and target languages do not belong to the same linguistic family, like Japanese and Fnglish, tile situation becomes grave. As one typical example of MT dictionary compilation, we have selected Japanese and English patent doemnents which contain many state-of-the-m~t technical terms. Althougb thes~ documents are not cul-
منابع مشابه
Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translation...
متن کاملIterative, MT-based Sentence Alignment of Parallel Texts
Recent research has shown that MT-based sentence alignment is a robust approach for noisy parallel texts. However, using Machine Translation for sentence alignment causes a chicken-and-egg problem: to train a corpus-based MT system, we need sentence-aligned data, and MT-based sentence alignment depends on an MT system. We describe a bootstrapping approach to sentence alignment that resolves thi...
متن کاملCombining Linguistic Data Views for Phrase-based SMT
We describe the Spanish-to-English LDVCOMBO system for the Shared Task 2: “Exploiting Parallel Texts for Statistical Machine Translation” of the ACL-2005 Workshop on “Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond”. Our approach explores the possibility of working with alignments at different levels of abstraction, using different degrees of linguistic annotation....
متن کاملLanguage Models for Machine Translation: Original vs. Translated Texts
We investigate the differences between language models compiled from original target-language texts and those compiled from texts translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translat...
متن کاملUser-Oriented MT Evaluation and Text Typology
A brief survey of user-oriented MT evaluation methodologies suggests that they all suffer drawbacks in terms of time and/or generality and/or user interpretability. The time and cost of evaluation, coupled with system prices, make it increasingly less likely that full-scale pre-purchase evaluation by a single potential user will be economic. It is clearly desirable that MT systems should be “ge...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002